-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Missing values by Sentinels #9363
Conversation
Sentinels represent missing values (NA) by rarely used values within the domain of a datatype. E.g. NA_Bool is represented by the integer value `2`. We extend arithmetic and mathematical operations to respect the new semantics of these special values. Each operation is now accompanied by an `isNA` check. ``` Example ------- julia> x = Option(1) 1 julia> x + 3.14 4.140000000000001 julia> typeof(x + 3.14) Option{Float64} julia> x = Array{Option{Int32}}([1, 2, NA, 4]) 4-element Array{Option{Int32},1}: 1 2 NA 4 julia> x[4] = NA NA julia> x 4-element Array{Option{Int32},1}: 1 2 NA NA julia> mean(x) NA julia> mean(nonnull(x)) 1.5 ```
Would be good to get @wesm to chime in here. |
I think this would be a great package. We probably don't want to add another approach to missing data to Base right now. However the cross-language interop aspect is interesting. The code is pretty good; I would say you've picked up the language quite well. The first issue I see is that converting both 1 and 2 to
|
Can you say more about your concerns here? When I say "every" I really mean these. These lists plus a tiny bit of meta-programming appear to go a long way. I guess my goal here is to make missing values syntactically invisible to the user in the common case. For all its faults, R does this well. |
Happy to move this to a package. Still like to get feedback here if we can keep the PR open for a bit. Tips on how to get past some of the type issues would be welcome. I suspect I need to read up on Julia's abstract types to define some sort of |
promote_rule{T<:Number, S<:Number}(::Type{T}, ::Type{Option{S}}) = | ||
Option{promote_rule(T, S)} | ||
promote_rule{T<:Number, S<:Number}(::Type{Option{T}}, ::Type{S}) = | ||
Option{promote_rule(T, S)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is promote_rule
commutative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typically you only need to define it in one direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
promote_rule
isn't but promote_type
uses typejoin(promote_rule(S,T), promote_rule(T,S))
so it is guaranteed to be symmetric, even though promote_rule
isn't.
FWIW, I've actually stopped believing that sentinels are the correct universal solution for representing missing data. It might make sense to have a compatibility layer in Julia for receiving/emitting data with missing values encoded in a particular way (for interactions with R/Python/etc.), but the kicker in my view is that it renders you not 100% compatible with database systems (which use an extra bit or byte to represent nullity, and thus are not burdened by missing bit patterns in each of their data types, e.g. Unfortunately, Python-land is still a bit crippled in this regard, with pandas's hand-wavy support for missing data (a legacy issue of course; we made the best decisions we could under the circumstances and still ship a piece of software that people will adopt broadly) and lack thereof in NumPy. Happy to speak some more about this sometime; I should write a more well-reasoned and -written blog post ;) |
Thanks, @wesm, for chiming in. Much appreciated! |
I think it would be good to have this have the same API as the Nullable type in base. Then the user can choose whichever approach is more appropriate for his or her application. |
I don't have strong thoughts one way or the other on how to represent missing values. However I will go ahead and champion the sentinel choice for the sake of argument. In almost all cases finding a value in a data range is free. IEEE Floating point has 53 bits in NaN to play with, Bool has 256 - 2 options, signed integers generally have one more negative number than positive number (e.g. int8 goes from -128 to +127). The only troublesome case seems to be, as Wes points out, In terms of database compatibility it's rare for me to see user code interact directly with database storage systems. Cross-language operation with R seems more common/important to me, my personal experience is a pretty biased sample though. Moving back a bit from how one represents null values in memory one might consider just how users interact with them. I think that naive Mostly though, I think that the main reason to go with sentinels isn't about performance, it's about seemless compatibility with legacy R code. |
There are two major issues with using sentinels – performance and lack of composability. Floats do happen to work reasonably well using NaNs as NA (although then you have weird crap like whether NaN + NA is NaN or NA depending on argument order) – this is the case in R: > NaN + NA
[1] NaN
> NA + NaN
[1] NA
> For integers, it's worse since the native operations do random stuff to sentinel values, so you have to either put checks around all your accesses or do the work normally and then patch it up at the end (but this is what you do for the case of keeping NA flags separately anyway). The composability issue is a much more intractable one. How do you support arbitrary data types that can be NA? There's no good way to do it (that I've ever seen). |
Yeah, for arbitrary types things do become unclean. The null byte addition is more easily generalizable. Solutions in the sentinel case:
How does performance differ between sentinels and other solutions to represent nullity? There must be a check at some point. I can imagine masked array solutions like |
The performance advantage for using an external array to store a NA mask is that you can use BitVectors for them and then compute the combined NA mask of w[i] = ifelse((u[i] == SENTINEL) | (v[i] == SENTINEL), SENTINEL, u[i] + v[i]) ? I don't know. I also don't know that the version with packed bitmasks is faster, but I'm betting it is. |
That matches my intuition as well. I'm not generally comparing scalar sentinels to arrays with separate bitmask arrays. I'm mostly comparing them to things like |
We wouldn't be storing things as arrays of Nullables, or at least that's not the plan, I don't think. |
Yes, I think an alternative API which would be more tolerant to missing values would be very useful for many people. When you know for sure that (many) missing values are present in your data, you really don't want to get any error for them, or you'll spend your day handling errors which clutter your code. As Terry Therneau says, "R succeeded because it's useful", and we need to make Julia as (and more) useful as R as regards missing values for it to succeed on that user segment. |
How you represent that a value is missing is orthogonal to how easy it is to use. The main issue is that in Julia |
A couple of points to add to the discussion: (1) It's very important that any API work with more than just primitive types. If you want Julia to interact with Hive using the same API that we use to interact with SQLite3, we can't allow any divide to exist between primitive types and more complex types like Array columns and Map columns. (2) Whether user code interacts with databases directly depends on who you think of as a typical user. I would bet you're probably thinking of scientists. But I'm thinking of people who build websites. Those people are already using a bunch of languages that have constructs like (3) I'm not convinced that interop with R is very important. Many of the classic R packages are actually wrappers around C or Fortran code that doesn't care about R's sentinel semantics. This is true, for example, of the glmnet package -- which unconditionally prohibits data from containing any missing values. I believe that we need interop with the C code from those packages, but not necessarily with the R wrappers around those C libraries. (4) Allowing (5) I'm generally not that worried about things feeling "complicated" to users. I think that many definitions of "complicated" ignore the problem that a language that seems "complicated" on day 1 may seem "simple" on day 10, whereas a language that seems "simple" on day 1 may seem "complicated" on day 10. This is my exact experience with both R and Perl: the languages seem more and more complicated as I get to know them better. In contrast, the definition of (6) I think there's not a lot to be gained from saying that "R succeeded because it's useful" without causal evidence to support those kinds of conclusions. R might well have succeeded because of randomness in how programming communities form. |
@StefanKarpinski Yes, I think that's what @mrocklin meant in the first sentence I quoted. But I'm not sure about the type instability: as long as operations with one nullable always give a nullable, everything should be stable. |
The problem with silent propagation of |
I agree with this. However, I do think that we should have types like the ones in this PR that provide binary compatibility with R while presenting the same interface as Nullable. That way we can do zero-copy interaction with R and also interact smoothly with databases, etc. This is not necessarily an either or kind of decision. But we do need to settle on a common interface. |
@johnmyleswhite I think we have to agree to disagree on the question of silent |
I agree that the important part is the API, not the implementation. But they're unfortunately somewhat intertwined. For example, if we want to support the semantics that Evan Miller advocated in his recent blog post, then we need to ensure that every |
I've been meaning to read that post (and reply to the email you sent me about it). I will do so and reply here :-) |
I figured this would be an effective way to goad you into doing so. |
@nalimilan If you write up some code that implements "an easy way to skip missing values", I'm happy to merge it into DataFrames or any other JuliaStats package. But I suspect you'll find it's even harder than adding sane tools for working with rounding modes to Base Julia. |
@johnmyleswhite As I said elsewhere, I suspect this will in the end require a separate type whose behavior would be to have As regards Evan Miller's approach, it could be interesting to try, but I don't see how allocating a vector of weights could be practical or efficient. What could be done is multiplying values by But I don't get the interest of his two-layer idea: a way of viewing the same data with two different bitmasks might be useful, but it doesn't need to be supported by default by |
I believe Evan's vision is that the memory cost of frequency weights is low, but that they allow you to write branch-free, SIMD-friendly code. I agree that the easiest way to handle the two kinds of missingness problem would be to use two layers. Regarding multiple kinds of missingness, I think that I'm not involved with the survey world enough to see the value in N kinds of missingness, given that I can easily imagine needing N + 1 kinds of missingness -- which I would tend to handle via categorical variables and some well-placed calls to |
Re: Evan Miller's approach, I'm not sure the frequency weights would have to be anything besides an AbstractArray that wraps the NA bitmask. I haven't benchmarked it, but LLVM can (mostly) vectorize: immutable BitWrapper
x::BitVector
end
Base.getindex(x::BitWrapper, i) = !Base.unsafe_bitgetindex(x.x.chunks, i)
function f(x::Vector{Float64}, y::BitWrapper)
z = zero(eltype(x))
@simd for i = 1:length(x)
@inbounds z += y[i]*x[i]
end
z
end Note that because of the way we define |
@johnmyleswhite I don't think that's an absolutely required feature for handling survey data, as can be seen from the fact that R does not support this and still is relatively successful in that area. But the idea is that you want to preserve the information about the cause of missingness (which comes as a level of a categorical variable in data files provided by survey institutes), and yet skip/hide these values by default in frequency tables, regressions, etc. Here you see that the question of the default behavior is considered as important enough by software in this field. Then it also offers a way of distinguishing several types of missingness for numeric variables. This could certainly be handled via a separate categorical variable, but for consistency software generally supports the same pattern as for categorical variable. @simonster Yeah, that's what I meant. |
@nalimilan It seems easiest to handle that problem using a We can certainly treat the bit array as degenerate frequency weights, but, for the more general case of statistical modeling, we need to actually support real frequency weights. This is something that R does somewhat haphazardly, which makes working with massive data sets quite difficult -- you can usually easily fit a weighted model, but it's much harder to get statistics back out that correctly reflect the weighting. |
Well, I'm not sure what you mean exactly, but that's really not my number one priority. I'd say better get one single type of missingness working first.
Agreed, we really need something more consistent than R. I was thinking one of the columns of a In that scheme, missingness would be completely orthogonal to weights, except that the algorithm could multiply the weights by the |
I mean that you can have something like I'm not really fond of |
I did some benchmarks for simple Float64 summation skipping missing values in JuliaStats/DataArrays.jl#133. The benchmarks indicate that memory access is the bottleneck. The NaN sentinel approach has basically zero overhead, whereas with missingness represented as a Vector{Bool} the benchmark takes about 9/8 as long, and with a Vector{Float64} it takes about 2x as long. The BitVector approach currently used by DataArrays is many times slower than sentinels without some work, but throwing a lot of tweaks at it makes it ~25% slower than the sentinel approach (not much slower than a Vector{Bool}), and even then I don't think we've hit the limit of what the hardware could do since LLVM isn't vectorizing the loads. Unfortunately it takes some work to get that performance with BitVectors, whereas Vector{Bool} is fast "by default." |
Dot product or similar operations might have different results. |
@mrocklin A dot product is twice as many operations and accesses twice as much memory, so I'd expect it to have roughly similar performance characteristics. At least for Vector{Bool} vs. the sentinel approach, I can confirm that this is true; Vector{Bool} adds about ~15% overhead in both cases. |
Sorry, by dot product I meant matrix-matrix multiplication or more generally tensor-tensor contraction or more generally, something with higher FLOP/byte intensity. |
@simonster I was wondering about the performance implications of the different methods. Thanks. |
@mrocklin Is matrix multiplication really a useful operation when some elements are missing? The missing elements are going to poison a large part of the result. |
If missing values are to be propagated, determining which rows/columns are corrupted is O(n^2) whereas the multiplication itself is O(n^3), so asymptotically there should be no difference between methods. If missing values are to be ignored, then the complexity of matrix multiplication is still the same as for n^2 O(n) pairwise dot products, but the memory access pattern is different. Actually establishing that any differences are due to storage and not implementation might be difficult: My impression is that a lot of work goes into optimizing |
I think Simon's analysis is dead on. It would be great to get a set of benchmarks that describe which operations we want to make fast. Computing something like That said, I think the goal of this PR wasn't performance: I think it was having missing data stored in a format that can be passed to R. I think that is something worth doing well (and this PR is a great start), but I'd prefer to view the R sentinel values format as a serialization format that we can spit out, rather than our native data format. Another question raised by Milan is where you draw the line and stop implementing functions on |
I was just trying to find an example proxy operation that has higher compute intensity. Judging performance on linear algorithms is likely to be just a judgment on compactness in memory. |
My intuition is that, with higher compute intensity, the method of encoding missing values should matter less. At some point it becomes fastest just to branch on missingness. |
@mrocklin I agree that we need to do test more operations, but I think it's mostly about access patterns, e.g., summing matrix rows with BitArray missingness totally blows. Feel free to give input (or code) for representative benchmarks at JuliaStats/DataArrays.jl#133. |
I think we should close this. There was quite a lot of good discussion, but it seems that actual development has gone the direction of |
Closing seems appropriate to me. Thanks for the chat. |
Here is a draft for missing value support by sentinels. This is my first real go at Julia so I suspect I'm doing many things badly. Still, hopefully the general design is sound. More on missing bits at the end of this comment. Work was done with @dchudz.
Sentinels represent missing values (NA) by rarely used values within the domain of a datatype. In particular we use the following values for the following types:
Bool
: The integer 2Float32/64
: The exponent signallingNaN
and the mantissa 1954 (the birth year of an R creator)Int32/64
: The minimum integer -2**31/63 + 1Complex
: The float definition ofNA
for both real and imaginary partsWe extend arithmetic and mathematical operations to respect the new semantics of these special values. Each operation is now accompanied by an
isNA
check.The choice of sentinel values here agrees with the choices made by the R language and subsequently by the DyND project in Python (cc @mwiebe) (article on R sentinels.)
Adopting this as a cross-language standard for missing values might be of some use.
Example
Issues
I have no idea if this is an appropriate solution for missing values in Julia.
DataArray
(masked arrays) andNullable
(explicitly boxed) are two other fine solutions in various contexts. This is just a third.If this is an appropriate solution then certainly my implementation could use a lot of work. In particular
Number
we really mean{Int32, Int64, Float32, Float64, ...}
all of the Optionable types.Option
does not extend to non-primitive numeric types.NA
could use better interaction with promotion rules and such.Also, presumably people have discussed this before. I'm coming into this without much context. My apologies.